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SUMMARY 

Inference for Dirichlet process hierarchical models is typically performed using Markov 
chain Monte Carlo methods, which can be roughly categorised into marginal and condi- 
tional methods. The former integrate out analytically the infinite-dimensional compo- 
nent of the hierarchical model and sample from the marginal distribution of the remaining 
variables using the Gibbs sampler. Conditional methods impute the Dirichlet process 
and update it as a component of the Gibbs sampler. Since this requires imputation 
of an infinite-dimensional process, implementation of the conditional method has relied 
on finite approximations. In this paper we show how to avoid such approximations 
by designing two novel Markov chain Monte Carlo algorithms which sample from the 
exact posterior distribution of quantities of interest. The approximations are avoided 
by the new technique of retrospective sampling. We also show how the algorithms can 
obtain samples from functionals of the Dirichlet process. The marginal and the condi- 
tional methods are compared and a careful simulation study is included, which involves 
a non-conjugate model, different datasets and prior specifications. 

Some keywords: Exact simulation; Mixture models; Label switching; Retrospective sam- 
pling; Stick-breaking prior 

1. Introduction 



Dirichlet process hierarchical models, also known as Dirichlet mixture mode 



standard in semiparametric inference; applications include density estimation ([Muller et al 



s, are now 



1 



ance 



Miiller et al 



20031 ). meta-analysis ( 



19961 ) , survival analysis (jGelfand &: Kottasl . 120031 ) . semi-parame tric analysis of vari 



20051') . cluster analysis and partition modellin g (IQuintana. k. Iglesiaa . 



Burr &: Doss 



Teh et al 



20061). The 



20051 ) and machine learning 
hierarchical structure is as follows. Let f{y \ z, A) be a parametric density with param- 
eters z and A, let Hq be a distribution indexed by some parameters and let Be(a, (3) 
denote the beta distribution with parameters a, (5. Then we have 



Yi I {Z,K,\) ~ fiXi I Zk^.X), i = l,...,n 

oo 

Ki\p ~ ^Pj^j{-) 

|e - He, j = l,2,... (1) 
Pi = Vi, pj = (1 - Vi)il -V2)---{1- Vj^i)Vj, J > 2 
Vj ^ Be(l,a). 

Here V = (Vi, V2, ■ ■ ■) and Z = {Zi, Z2, ■ ■ ■) are vectors of independent variables and are 
independent of each other, the KiS are independent given p = {pi,P2, ■ ■ ■), and the l^s are 
independent given Z and K = {Ki,K2, . . . , Kn). Therefore, ([T]) defines a mixture model 
in which f{y j z, A) is mixed with respect to the discrete random probability measure 
P{dz), where 

00 

P{-) = J2p^Sz,{-), (2) 

i=i 

and 5x{-) denotes the Dirac delta measure centred at x. Note that, the discreteness of 
P implies a distributional clustering of the data. The last three lines in the hierarchy 
identify P with the Dirichlet pr ocess with base me asure Hq and concentration parameter 
a; this representation is due to ISethuramanI (119941 ). For the construction of the Dirichlet 



process prior see Ferguson (1973,1974), for properties of Dirichlet mixture models see 
for example Lo (1984), Ferguson (1983), an d Antoniak (1974), and for a recent arti- 
cle on modelling with Dirichlet processes see I Green &: RichardsonI (|200ll ) . Using more 



general beta distributions in the last line gives rise to the rich class of stick-breaking ran 



dom measures and the general hierarchical framework introduced by 



Ishwaran &: James 
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(|200ll ). We have consciously been vague about the spaces on which Yi and Zj hve; it 
should be obvious from the construction that a great deal of flexibility is allowed. 

We refer to Zj and pj as the parameters and the weight respectively of the jth 
component in the mixture, and to Ki as the allocation variable, which determines to 
which component the ith data point is allocated. Note that the prior on the component 
weights Pj imposes a weak identifiability on the mixture components, in the sense that 
E{pj) > E{pi) for any j > I. Throughout the paper we use the nomenclature 'cluster' 
to refer to a mixture component with at least one datum allocated to it. The hyperpa- 
rameters a, Q and A will be consider fixed until §3-71 We will concentrate on inference 
for the allocation variables, the component parameters and weights, and P itself. 

Inference for Dirichlet mixture models has been made feasible by Gibbs sampling 
techniques which have been developed since the seminal work in an unpublished 1988 
Yale University Ph. D. thesis by M. Esco bar. Alternative Monte Carlo schemes d o 



exist and include the sequential samplers of 



Lhil 11993) and 



Ishwaran Sz Jamed ()2003l ) 



and referred to a s the blocked Gibbs sampler in th ose papers, the parti c 



Fearnhead 



Jain Sz Neal 



( 20041') and the reversible jump method of 



Green RichardsonI (120011 ) and 



e filter of 



(|2004l ). We concentrate on Gibbs sampling and related componentwise 
updating algorithms. 

Broadly speaking, there are two possible Gibbs sampling strategies, corresponding 
to two different data augmentation schemes. The marginal method exploits convenient 
mathematical properties of the Dirichlet process and integrates out analytically the ran- 
dom probabilities p from the hierarchy. Then, using a Gibbs sampler, it obtains samples 
from the posterior distribution of the -fCj's and the parameters and the weights of the clus- 
ters. Integrating out p induces prior dependence among the i^j's, and makes the labels 
of the clusters unidentifiable. This approach is easily carried out for conjugate models, 
where further analytic integrations are possible; ([T]) is said to be conjugate when HQ{dz) 
and f{y \ z. A) form a conjugate pair for z. Implementati on of the marginal method fo r 



non-conjugate models is considerably more complicated. 



MacEachern Miillei 



(119981) 
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and iNeall ([20001) are the state-of-the-art implementations in this context. 

The conditional method works with the augmentation scheme described in ([T]). It 
consists of the imputation of p and Z, and subsequent Gibbs sampling from the joint 
posterior distribution of {K,p, Z). The conditional independence structure created with 
the imputation of {p, Z) assists the simultaneous updating of large subsets of t he vari- 



ables. The conditional method was introduced in 
further developed in 



Ishwaran &: James (2001 



Ishwaran fc Zarepouij (j2000l ). it was 



20031 ) and it has two considerable ad- 



vantages over the marginal method. First, it does not rely on being able to integrate 
out analytically components of the hierarchical model, and therefore it is much more 
flexible for current and future elaborations of the basic model. Such extensions include 
more general stick-breaking ran dom measures, and rn odelling dependence of the data 



on covariates, as for example in 



Dunson &: ParkI (|2006l ). A second advantage is that in 



principle it allows inference for the latent random measure P. 

The implementation of the conditional method poses interesting methodological chal- 
lenges, since it requires the im putation of the infinite - dimen sional vectors p and Z. The 
modus operandi advocated in llshwaran &: Zarepoun (120001 ) is to approximate the vec- 



tors, and thus the corresponding Dirichlet process prior, using some kind of truncation, 
such as p^^'^ = {pi, . . . ,pn) and Z^'^^ = {Zi, . . . , Z^), where determines the degree of 
approximation. Although in some cases it is possib l e to control the error pro duced by 



such truncations (jlshwaran &: Zarepour 



20001 . 



2002 



Ishwaran &: Jamesi . i2001i ) it would 



be desirable to avoid approximations altogether. 

This paper introduces two implementations of the conditional method which avoid 
any approximation. The proposed Markov chain Monte Carlo algorithms are very easily 
implemented and can readily be extended to more general stick-breaking models. The 
implementation of the conditional method is achieved by retrospective sampling, which 
is introduced in this paper. This is a novel technique which facilitates exact simulation 
in finite time in problems which involve infinite-dimensional processes. We also show 
how to use retrospective sampling in conjunction with our Markov chain Monte Carlo 
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algorithms in order to sample from the posterior distribution of functional of P. 

We identify a computational problem with the conditional method. As a result of 
the weak identifiability imposed on the labels of the mixture components, the posterior 
distribution of the random measure (p, Z) is multimodal. Therefore, the Gibbs sam- 
pler has to visit all these different modes. This problem is not eliminated with large 
datasets, since although the secondary modes become smaller, the energy gap between 
the modes becomes bigger, and thus the sampler can get trapped in low-probability ar- 
eas of the state-space of (p, Z). We design tailored label-switching moves which improve 
significantly the performance of our Markov chain Monte Carlo algorithm. 

We also contrast our retrospective Markov chain Monte Carlo algorithm with state- 
of-the-art implementations of the marginal method for non-conjugate models in terms 
of their Monte Carlo efficie ncy. In particular we cons ider the no -gaps algorithm of 



Neal 



(|200d ) . This comparison 



MacEachern Sz Miilleii (|l998l ) and Algorithms 7 and 8 of 
sheds light on the relative merits of the marginal and conditional method in models 
where they can both be applied. We find that the marginal methods slightly outperform 
the conditional. 

Since the original submission of this paper, the retrospective sampling ideas we intro- 
duce he re have been found very useful i n ext ensions of the Dirichlet process hierarchical 



model ( Dunson &: Park 



2006 



Griffin 



processes; see for example IBeskos et al, 



2006). and in the exact simulation of diffusion 



(l2006bl ). 



2. Retrospective sampling from the Dirichlet process prior 

Consider the task of simulating a sample X = {Xi, . . . from the Dirichlet process 
prior (l2|). Such a sample has the property that Xi \ P ^ P, Xi is independent of Xi 
conditionally on P for each / 7^ i, i,l = l,...,n, and P is the Dirichlet process with 
concentration parameter a and base measure Hq. This section introduces in a simple 
context the technique of retrospective sampling and the two different computational 
approaches, marginal and conditional, to inference for Dirichlet processes. There are 
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essentially two different ways of obtaining X. 

The sample X can be simulated directly from its marginal distribution. When P is 
marginalised the joint distribution of the X,;s is known an d it is described by a Polya 



urn scheme (jBlackwell &: MacQueen 



19731 : 



Fergusonl . 



19731 ). In particular, let Ki be an 



allocation variable associated with Xi, where Ki = j if and only if Xi = Zj. There- 
fore, the Ki's decide which component of the infinite series ([2]) Xi is associated with. 
Conditionally on P the i^j's are independent, with 

pr{Ki = j \ p) = pj, for alH = 1,. . . ,n, j = 1,2,. . . . (3) 

The marginal prior of = (i^i, . . . , Kn) is obtained by the following Polya urn scheme: 
pr{Ki = j \ Ki,..., Ki^i) = - — , '.'-^ ^, , iiKi= j for some I <i, 

Oi 

w{Ki / for alU < i I Ki, . . . , = - — — , 

[a + I — \) 

where njj denotes the size of the set {/ < i : Ki = j}. Thus, the probability that 
the iih sample is associated with the jth component is proportional to the number of 
samples already associated with j, whereas with probability a/ {a+i — 1) the ith sampled 
value is associated with a new component. Note that, whereas in ([3]) the labels j are 
identifiable, in (jH) the labels of the clusters are totally arbitrary. Simulation of X from 
its marginal distribution proceeds in following way. Set Ki = 1, although any other 
label could be chosen; simulate ~ Hq; and set Xi = (jii. For i > 1, let c denote 
the number of existing clusters; simulate Ki conditionally on Ki, . . . , according to 
the probabilities ([1]), where j = 1, . . . ,c; if Ki = j then set Xi = (pj, and otherwise 
set c = c + 1, simulate (pc ~ Hq independently of any previously drawn values, and 
set Xi = (j)c. At the end of this procedure 4>f = ('/'ii •••)</*£) are the parameters of the 
mixture components which are associated with at least one data point. The indexing 
1, . . . , A; is arbitrary and it is not feasible to map these (pfs to the Zj^s in the definition 
of the Dirichlet process in 

Alternatively, X can be simulated following a two-step hierarchical procedure. Ini- 
tially, a realisation of P is simulated, and then we simulate independently n allocation 



variables according to ([3]) and we set Xi = Zk^- However, simulation of P entails the 
generation of the infinite vectors p and Z, which is infeasible. Nevertheless, this problem 
can be avoided with the following retrospective simulation. The standard method for 
simulating from the discrete distribution defined in ^ is first to simulate Ui from a 
uniform distribution on (0, 1), and then to set Ki = j if and only if 

j-i i 

Y,Pi<Ui<^pi , (5) 
1=0 1=1 

where we define po = 0; this is the inverse cumulative distribution function method for 
discrete random variables; see for example Ripley (1987, §3.3). Retrospective sampling 
simply exchanges the order of simulation between Ui and the pairs {pj, Zj). Rather than 
simulating first {p,Z) and then Ui in order to check ([5]), we first simulate the decision 
variable Ui and then pairs {pj, Zj). If for a given Ui we need more pj's than we currently 
have in order to check ([5]), then we go back and simulate pairs {pj, Zj) 'retrospectively', 
until ([5]) is satisfied. The algorithm proceeds as follows. 

ALGORITHM 1. Retrospective sampling from the Dirichlet process prior 
Step 1. Simulate pi and Z\, and set N* = 1, i = 1 and pQ = 0. 
Step 2. Repeat the following until i > n. 
Step 2.1. Simulate Ui ~ f/n[0, 1] 

Step 2.2. // 1^ is true for some k < N* then set Ki = k, Xi = Zk, for i = i + 1, and 
go to Step 2. 

Step 2.3. // ([^ is not true for any k < N* then set N* = N* + 1, j = N* , simulate pj 
and Zj, and go to Step 2.2. 



In this notation, A^* keeps track of how far into the infinite sequence {{pj,Zj), j = 
1,2, . . .} we have visited during the simulation of the Aj's. Note that the retrospective 
sampling can be easily implemented because of the Markovian str ucture of th e pj's and 



the independence of the Zj^s. A similar scheme was advocated in 



DossI (119941 ). 



The previous simulation scheme illustrates the main principle behind retrospective 
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sampling: although it is impossible to simulate an infinite-dimensional random object we 
might still be able to take decisions which depend on such objects exactly avoiding any 
approximations. The success of the approach will depend on whether or not the decision 
problem can be formulated in a way that involves finite-dimensional summaries, possibly 
of random dimension, of the random object. In the previous toy example we formulated 
the problem of simulating draws from the Dirichlet process prior as the problem of com- 
paring a uniform random variable with partial sums of the Dirichlet random measure. 
This facilitated the simulation of X in finite time avoiding approximation errors. In gen- 
eral, the retrospective simulation scheme will require at the first stage simulation of both 
the decision variable and certain finite-dimensional summaries of the infinite-dimensional 
random object. Thus, at the second stage we will need to simulate retrospectively from 
the distribution of the random object conditionally on these summaries. This conditional 
simulation will typically be much more elaborate than the illustration we gave here. 

We shall see in J}3]that these ideas extend to posterior simulation in a computationally 
feasible way. However, the details of the method for posterior simulation are far more 
complicated than the simple method given above. 



3. The retrospective conditional method for posterior simulation 

3T Posterior inference 

When we fit ([T]) to data Y = {Yi^ . . . , Yn), there are a number of quantities about which 
we may want to make posterior inference. These include the classification variables Ki, 
which can be used to classify the data into clusters, the number of clusters in the popula- 
tion, the cluster parameters {Zj : Ki = j for at least one i} and the cluster probabilities 
{pj : Ki = j for at least one i}, the hyperparameters a, and A, the predictive distribu- 
tion of future data, and the random measure P itself. None of the existing methods can 
provide samples from the post erior distribution o f P w ithout resorting to some kind of 
approximation; see for example I Gelfand &: Kottaa (120021 ) for some suggestions. However, 



exact posterior simulation of finite-dimensional distributions and functionals of P might 
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be feasible; see ^3-6[ 

Markov chain Monte Carlo techniques have been developed for sampling- based explo- 



ratio n of the posterior distributions outlined above. The conditional method (llshwaran &: Zarepouii . 



2000) is based on an augmentation scheme in which the random probabilities p = 
{pi,P2, ■ . ■) and the component parameters Z = {Zi, Z2, ■ ■ ■) are imputed and the Gibbs 
sampler is used to sample from the joint posterior distribution of {K,p,Z) according 
to the full-conditional distributions of the variables. Ho wever, the fact that y and Z 



Ishwaran &: Zarepouii (j200d ) 



have countably infinite elements poses a major challenge 
advocated an approximation of these vectors, and therefore an approximation of the 
corresponding Dirichlet process prior. For instance a truncation of the Sethuraman rep- 
resentation ([2]) could be adopted, e.g. p^^'^ = (pi, . . . ,pn) and Z^^^ = {Zi, . . . , Z^), 
where determines the degree of approximation. 

In this section we show how to avoid any such approximation. The first step of 
our solution is to parameterise in terms of {K,V,Z), where V = {Vi,V2, ■ ■ ■) are the 
beta random variables used in the stick-breaking construction of p. We will construct 
algorithms which effectively can return samples from the posterior distribution of these 
vectors, by simulating iteratively from the full conditional posterior distributions of K, V 
and Z. Note that we can replace direct simulation from the conditional distributions with 
any updating mechanism, such as a Metropolis-Hastings step, which is invariant with 
respect to these conditional distribution. The stationary distribution of this more general 
componentwise-updating sampler is still the joint posterior distribution of {K, V, Z). 

We now introduce some notation. For a given configuration of the classification 
variables K = {Ki, . . . , Kn), we define 

n 

"^3 = ^MK,=j}^ i = i,2,..., 

i=l 

to be the number of data points allocated to the jth. component of the mixture. Moreover, 
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again for a given configuration of K let 

I = {1,2,...} 

j(al) ^ {j el : mj > 0} 

Therefore, I represents all components in the infinite mixture, I'^^ is the set of all 'alive' 
components and I^'^^ the set of 'dead' components, where we call a component 'alive' if 
some data have been allocated to it. The corresponding partition of Z and V will be 
denoted by Z^'i), and V^'^l 

The implementation of our Markov chain Monte Carlo algorithms relies on the results 
contained in the following Proposition, which describes the full conditional posterior 
distributions of Z, V and K. 

Proposition 1. Let 7r(z \ Q) be the density of HQ{dz). Conditionally on (Y,K,Q,X), 
Z is independent of (V, a) and it consists of conditionally independent elements with 

He, for all j G 

Zj I y, e, A ~ < 

UrK.=j f{Yi I Z„ X)7r{Zj I 9) for all j G /('^D , 
Conditionally on {K, a), V is independent of {Y, Z, Q, A) and it consists of conditionally 
independent elements with 

Vj \ K,a Be + 1, n — mi + for all j = 1,2, . . . . 

Conditionally on (Y,V, Z, X), K is independent of {Q,a) and it consist of conditionally 
independent elements with 

pr{Ki =j\Y, V, Z, X} oc pjf{Yi \ Z,,X), j = 1,2, . . . . 



The proof of the Proposition follows directly from the conditional independence struc- 
ture in the model and t he stick-breaking representation of the Dirichlet process; see 



Ishwaran &: JamesI (|2003l ) for a proof in a more general context. 
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The conditional independence structure in the model effects the simultaneous sam- 
pling of any finite subset of pairs {Vj, Zj) according to their full conditional distributions. 
Simulation of {Zj, Vj) when j G /('^^ is trivial. When dealing with non-conjugate models, 
the distribution of Zj for j € I^^^^ will typically not belong to a known family and a 
Metropolis-Hastings step might be used instead to carry out this simulation. Note that 
samples from the full conditional posterior distribution of pj's are obtained using samples 
of {Vh,h < j) and the stick-breaking representation in ([T]). 

The Ki^s are conditionally independent given Z and V. However, the normalising 
constant of the full conditional probability mass function of each Ki is intractable: 

oo 

Ci = ^Pjf{Yi I Zj,X) . 
i=i 

The intractability stems from the fact that an infinite sum of random terms needs to be 
computed. At this stage one could resort to a finite approximation of the sums, but we 
wish to avoid this. The unavailability of the normalising constants renders simulation 
of the -fCj's highly nontrivial. Therefore, in order to construct a conditional Markov 
chain Monte Carlo algorithm we need to find ways of sampling from the conditional 
distribution of the Ki^s. 

3-2 A retrospective quasi-independence Metropolis-Hastings sampler for the allocation 

variables 

One simple and effective way of avoiding the computation of the normalising constants 
Ci is to replace direct simulation of the KiS with a Metropolis-Hastings step. Let k = 
(ki, . . . ,kn) denote a configuration of K = {Ki, . . . , Kn), and let 

max{A;} = max/cj 

i 

be the maximal element of the vector k. We assume that we have already obtained 
samples from the conditional posterior distribution of {{Vj,Zj) : j < maxj/c}} given 
K = k. Note that the distribution of (V,-, Zj) conditionally on Y and K = k \s simply 
the prior, Vj ~ Be(a, 1), Zj ~ Hq, for any j > max{A;}. 
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We will describe an updating scheme k ^ k* which is invariant with respect to 
the full conditional distribution of j y, Z, A. The scheme is a composition of n 
Metropolis-Hastings steps which update each of the KiS in turn. Let 

^ihj) — (^1) • • • ) ki-i,j, ki^i, . . . , kn) 

be the vector produced from k by substituting the ith element by j. When updating Ki, 
the sampler proposes to move from k to k{i,j), where the proposed j is generated from 
the probability mass function 



QiikJ) oc < 



PjfiYi I Zj, A), for j < max{/c} 



(6) 



Mi{k)pj, for j > max{A;} . 

The normalising constant of ([6]) is 

m ax { fc } / max { fc } 

Ci{k) = J2 Pjfi^i I Zj,X)+M,{k) 1 - Y Pj 
i=i V i=i 

which can be easily computed given {{pj, Zj) : j < max{fc}}. Note that, for j < max{A;}, 
qi{k,j) oc pr(i^i = j | Y,V,Z,X), while, for j > max{fc}, qi{k,j) cx pr(Ki = j \ V). 
Here Mi{k) is a user-specified parameter which controls the probability of proposing 
j > max{/c}, and its choice will be discussed in ^3-31 

According to this proposal distribution the ith data point is proposed to be re- 
allocated to one of the alive clusters j € I^^^^ with probability proportional to the 
conditional posterior probability pic{Ki = j \ Y,V, Z, X). Allocation of the ith data 
point to a new component can be accomplished in two ways: by proposing j € I^'^^ 
for j < max{A;} according to the conditional posterior probability mass function and 
by proposing j G I^'^^ for j > max{k} according to the prior probability mass func- 
tion. Therefore, a careful calculation yields that the Metropolis-Hastings acceptance 
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probability of the transition from K = k to K = k{i^j) is 



1, 



if J < niax{fc} and max{fc(i, j)} = niax{fc} 



ai{k,k{i,j)} 



= < 



mill < 1 



Ci(k)M{k{i,j)} 



if J < niax{fc} and max{fc(i, j)} < niax{fc} 



Ci{kii,j)}f(Y,\Z„.,X) 



V 



mm 



Ciik)fiY,\Zj,X) 
Ci{k{i,j)}M{k) 




if j > inaxjfc} 



Note that proposed re-allocations to a component j < max{k} are accepted with prob- 
ability 1 provided that max{k{i, j)} = max{A;}. If the proposed move is accepted, we 
set k = k{i,j) and proceed to the updating of i^j+i- The composition of all these steps 
yields the updating mechanism for K. 

Simulation from the proposal distribution is achieved by retrospective sampling. For 
each i = 1, . . . , n, we simulate Ui ~ Un[0, 1] and propose to set Ki = j, where j satisfies 



with qi{k, 0)=0. If ([7]) is not satisfied for any j < max{A;}, then we start checking the 
condition for j > max{A;} until it is satisfied. This will require the values of {Vi,Zi),l > 
max{A;}. If these values have not already been simulated in the previous steps of the 
algorithm, they are simulated retrospectively from their prior distribution when they 
become needed. 

We therefore have a retrospective Markov chain Monte Carlo algorithm for sampling 
from the joint posterior distribution of (K, V, Z), which is summarised below. 

ALGORITHM 2. Retrospective Markov chain Monte Carlo 
Give an initial allocation k = (ki, . . . , kn), and set N* = max{k} 
Step 1. Simulate Zj from its conditional posterior, j < max{A:}. 
Step 2.1 Simulate Vj from its conditional posterior, j < max{A;}. 
Step 2.2 Calculate pj = (1 - Vi) • • • (1 - Vj-i)Vj, j < max{k}. 
Step 3.1. Repeat the following until i > n. 




(7) 
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Step 3.2. Simulate Ui ~ f/n[0, 1] 

Step 3.3.1 // ^ is true for some j < N* then set Ki = j with probability ai{k, k{i,j)}, 
otherwise leave it unchanged. Set i = i + 1 and go to Step 3.1. 

Step 3.3.2 // (0) is not true for any j < N* , set N* = N* + 1, j = N* . Simulate {Vj, Zj) 
from the prior, set pj = (1 — Vi) •••(! — Vj-i)Vj and go to Step 3.3.1 
Step 3.4. Set N* = max{k} and go to Step 1. 

Note that both max{A;} and N* change during the updating of the Ki's, with A^* > 
max{A;}. Thus, the proposal distribution is adapting itself to improve approximation of 
the target distribution. At the early stages of the algorithm N* will be large, but as the 
cluster structure starts being identified by the data then N* will take much smaller values. 
Nevertheless, the adaptation of the algorithm does not violate the Markov property. It 
is recommended to update the -ftTj's in a random order, to avoid using systematically less 
efficient proposals for some of the variables. 

The algorithm we have constructed updates all the allocation variables K but only 
a random subset of (y,Z), to be precise {{Vj,Zj),j < N*}. However, pairs of the 
complementary set {{Vj, Zj),j > N*} can be simulated when and if they are needed 
retrospectively from the prior distribution. This simulation can be performed off-line 
after the completion of the algorithm. In this sense, our algorithm is capable of exploring 
the joint distribution of {K, V, Z). 

3-3 Accelerations of the main algorithm 

There are several simple modifications of the retrospective Markov chain Monte Carlo 
algorithm which can improve significantly its Monte Carlo efficiency. 

We first discuss the choice of the user-specified parameter Mi{k) in ([6]). This param- 
eter relates to the probability, p say, of proposing j > max{k}, where 

^ Cj(max{/c}) 

If it is desired to propose components j > max{fc} a specific proportion of the time 
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then the equation above can be solved for Mi{k). For e xamp l e, p = a/{n — 1) is the 



Neal 



(j200d ). We recommend 



probability of proposing new clusters in Algorithm 7 of 
a choice of Mi{k) which guarantees that the probability of proposing j > maxj/c} is 
greater than the prior probability assigned to the set {j : j > max{A;}}. Thus Mi{k) 
should satisfy 

max{fc} / max{fc} \ 

^ PjfiYi I Zj,cl))+Mi{k) 1 - ^^(^)- 
This inequality is satisfied by setting 

M,{k) = m&x{f{Yi I Zj,(^),j < max{fc}} . (8) 



Since ([6]) res embles an independence 



perspective. 



Mengersen &: Tweedie 



sampler for Ki, ([8]) is advisable from a theoretical 



19961 ) have shown that an independence sampler 
is geometrically ergodic if and only if the tails of the proposal distribution are heavier 
than the tails of the target distribution. The choice of Mi{k) according to ([H]) ensures 
that the tails of the proposal qi{k,j) are heavier than the tails of the target probability 
pr{Ki = j I Y,p, Z, A}. Note that when Mi{k) is chosen according to ([5]) then p is random. 
In simulation studies we have discovered that the distribution of p is very skewed and 
the choice according to ([H]) leads to a faster mixing algorithm than alternative schemes 
with fixed p. 

Another interesting possibility is to update Z and V after each update of the allo- 
cation variables. Before Step 3.2 of Algorithm 2 we simulate Z^'^^ from the prior and 
leave Z^^') unchanged. Moreover, we can synchronise N* and max{A;}. Theoretically, 
this is achieved by pretending to update {Vj,j > max{A;}} from the prior before Step 
3.2. In practice, the only adjustment to the existing algorithm is to set N* = max{k} 
before Step 3.2. These extra updates have the computational advantage of storing only 
and {Vj,j < maxjfc}}. In addition, simulations have verified that these extra up- 
dates improve significantly the mixing of the algorithm. Morover, one can allow more 
components j G /^'^^ to be proposed according to the posterior probability mass function 
by changing max{A;} in ^ to max{A;} -|- I, for some fixed integer /. In that case the 
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acceptance probability (|3-2p needs to be slightly modified, but we have not found gains 
from such adjustment. 

3-4 Multimodality and label- switching moves 

The most significant and crucial modification of the algorithm we have introduced is 
the addition of label-switching moves. These moves will have to be included in any 
implementation of the conditional method which updates the allocation variables one at 
a time. The augmentation of p in the conditional method makes the components in the 
infinite mixture ([1]) weakly identifiable, in the sense that E{pj} > E{pi\ for any j > /, 
but there is nonnegligible prior probability that pi > pj, in particular when |/ — j| is 
small. As a result the posterior distribution of (p, Z) exhibits multiple modes. In order 
to highlight this phenomenon we consider below a simplified scenario, but we refer to 
^ for an illustration in the context of posterior inference for a specific non-conjugate 
Dirichlet process hierarchical model. 

In particular, assume that we have actually observed a sample of size n from the 
Dirichlet process X = (Xi, . . . ,X„) using the notation of ^ This is the limiting case 
of ([1]) where the observation density f{y \ z, A) is a point mass at z. In this case we 
directly observe the allocation of the Xis into c, say, clusters, each of size n;, say, 
where X]l^=i ~ ^- common value of the XiS in each cluster / gives precisely the 
parameters 0/ of the corresponding cluster / = 1, . . . , c. However, there is still uncertainty 
regarding the component probabilities pj of the Dirichlet process which generated the 
sample, and the index of the component for which each cluster has been generated. Let 
Ki,l = 1, . . . ,c denote these indices for each of the clusters. We can construct a Gibbs 
sampler for exploring the posterior distribution oip and (i^i, . . . , Kc): this is a nontrivial 
simulation, and a variation of the retrospective Markov chain Monte Carlo algorithm has 
to be used. Figure [U shows the posterior densities of (pi,P2)P3) when n = 10 and c = 3 
with ni = 5, n2 = 4, 713 = 1, and when n = 100 and c = 3 with ni = 50, n2 = 40, 71,3 = 10. 
In both cases we took a = 1. 
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Note that the posterior distributions of the pjs exhibit multiple modes, because the 
labels in the mixture are only weakly identifiable. Note that the secondary modes become 
less prominent for larger samples, but the energy gap between the modes increases. In 
contrast, the probability of the largest component, maxjpj}, has a unimodal posterior 
density. This is shown in the right panel of Figure [D^c) ; simulation from this density has 
been achieved by retrospective sampling, see ^3-61 

In general, in the conditional method the Markov chain Monte Carlo algorithm has 
to explore multimodal posterior distributions as those shown in Figure [TJ Therefore, 
we need to add label-switching moves which assist the algorithm to jump across modes. 
This is particularly important for large datasets, where the modes are separated by ar- 
eas of negligible probability. A careful inspection of the problem suggests two types of 
move. The first proposes to change the labels j and / of two randomly chosen com- 
ponents j, I G I^^'-* . The probability of such a change is accepted with probability 
mm{l, {pj/pi)"^'^~^^}. This proposal has high acceptance probability if the two com- 
ponents have similar weights; the probability is indeed 1 if pj = pi. On the other hand, 
note that the probability of acceptance is small when \mi —mj\ is close to 0. The second 
move proposes to change the labels j and j ' + 1 of two neighbouring components but 
at the same time to exchange Vj with V^+i. This change is accepted with probability 
min{l, (1 — Vj+i)'"j /(I — Vj)"^^+^}. This proposal is very effective for swapping the labels 
of very unequal clusters. For example, it will always accept when mj = 0. On the other 
hand, if \pj —pj^i \ is small, then the proposal will be rejected with high probability. For 
example, if Vj — 1/2, V^+i — 1, and rrij — rrij^i, then the proposal attempts to allocate 
rrij is allocated to a component of negligible weight, Hti ~ ~ Thus, 
the two moves we have suggested are complementary to each other. 

Extensive simulation experimentation has revealed that such moves are crucial. The 
integrated autocorrelation time, see ^ for definition, of the updated variables can be 
reduced by as much as 50-90%. 



The problem of multimodality has also been addressed in a recent paper by 



Porteous et al 
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(i2ooa). 



3-5 Exact retrospective sampling of the allocation variables 

It is possible to avoid the Metropolis-Hastings step and simulate directly from the con- 
ditional posterior distribution of the allocation variables. This is facilitated by a general 
retrospective simulation scheme for sampling discrete random variables whose probability 
mass function depends on the Dirichlet random measure. 

We can formulate the problem as one of simulating a discrete random variable J, 
according to the distribution specified by the probabilities 

rr-= ^^^^ J = l,2,.... (9) 

The fj's are assumed to be positive and independent random variables, also independent 
of the p/s, which are given by the stick-breaking rule in ([T]), although more general 
schemes can be incorporated in our method. The retrospective sampling scheme of ^ 
cannot be applied here, since the normalising constant of ([9]), c = ^Pjfj, is unknown. 
In the specific application we have in mind, fj = fiYi \ Zj, A) for each allocation variable 
Ki we wish to update. 

Nevertheless, a retrospective scheme can be applied if we can construct two bounding 
sequences {q(A;)} | c and {c„(A;)} [ c, where ci{k) and c„(/c) can be calculated simply 
on the basis of (pi, Zi), . . . , (pfc, Zk)- Let ruj{k) = rj/ci{k) and rij(k) = rj/cu{k). Then 
we first simulate a uniform U and then we set J = j when 

i-i j 

X] ru,mik) <U<Y1 ri^rnik) ■ (10) 
m=l m=l 

In this algorithm k should be increased, and the pj's and /j's simulated retrospectively, 
until (llOp can be verified for some j <k. 

Therefore, simulation from unnormalised discrete probability distributions is feasible 
provided the bounding sequences can be appropriately constructed. It turns out that 
this construction is generally very challenging. In this paper we will only tackle the 
simple case in which there exists a constant M < oo such that fj < M for all j almost 
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surely. In our application this corresponds to f(y \ z,X) being bounded in z. In that 
case one can simply take 

k 

Cu{k) = ci{k) + M -Yj>i 

When limsup/j = oo almost surely, an alternative construction has to be devised, 
which involves an appropriate coupling of the /j's. This construction is elaborate and 
mathematically intricate, and will be reported elsewhere. 

3-6 Posterior inference for functionals of Dirichlet processes 
Suppose we are interested in estimating the posterior distribution of 

1= [ g{x)P{dx), 
Jx 

for some real-valued function where X denotes the space on which the component 
parameters Zj are defined and P is the Dirichlet random measure ([2]), given data Y from 
the hierarchical model ([l]). Our implementation of the conditional method shows that 
simulation from the posterior distribution of X is as difficult as the simulation from its 
prior. Let {(V^-, Zj),j < max{A;}} be a sample obtained using the retrospective Markov 
chain Monte Carlo algorithm. We assume that the sample is taken when the chain is 
'in stationarity', i.e. after a sufficient number of initial iterations. Given the Vj^s we can 
compute the corresponding , j < max{A;}. Recall that Zj ~ Hq and Vj ~ Be(l,a) for 
all j > max{A;}. Then we have the following representation for the posterior distribution 
of Z: a draw from I \ Y can be represented as 

max{fc} OO max{fc} j 

j = l j=max{fe} + l j = l I 11 = 1 

This is equality in distribution, where / is a draw from from the pri or distribution of X. 
Inference for linear functionals of the Dirichlet process was initiated in lCifarelli &: Regazzini 
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2001 



199C) and simulation aspec t s hav e been considered, for example in lGuglielmi &: Tweedie 



and 



Guglielmi et al 



(12002). 

Our algorithm can also be used for the simulation of non-linear functionals under 
the posterior distribution. As an illustrative example consider posterior simulation of 
the predominant species corresponding to J such that pj > pi for all / = 1, 2, . . .. This 
can be achieved as follows given a sample {(Vi, Zi),j < max{k}} obtained with any of 
our conditional Markov chain Monte Carlo algorithms. Let J = argmax|j<jv*}Pj, where 
N* = max{A;}. Then, it can be seen that if 1 — X^jLiPi < PJ then J is the predominant 
specie. Therefore, if 1 — Hat* > pj, we repeat the following procedure until the condition 
is satisfied: increase A^*, simulate pairs {Z^*, Vn*) from the prior, and compute J. This 
procedure was used to obtain the results in Figure HJc). 



3-7 Inference for hyperparameters 

A simple modification of the retrospective algorithms described above provides the com- 
putational machinery needed for Bayesian inference for the hyperparameters (0, a, A). If 
we assume that appropriate prior distributions have been chosen, the aim is to simulate 
the hyperparameters according to their full conditional distributions, thus adding one 
more step in the retrospective Markov chain Monte Carlo algorithm. 

Sampling of A according to its full conditional distribution is standard. At first glance, 
sampling of (0,a) poses a major complication. Note that, because of the hierarchical 
structure, (0,a) are independent of (Y^K) conditionally upon {V,Z). However, {V,Z) 
contains an infinite amount of information about {a, 0). Therefore, an algorithm which 
updated successively {V, Z, K) and (0, a) according to their full conditional distributions 
would be reducible. This type of convergence problem is not uncommon when Markov 



chain Monte Carlo is used to infer about 
processes; see 



Papaspiliopoulos et al 



lierarchical models with hidden stochastic 



(|200,1 



20061 ) for reviews. 



However, the conditional independence structure in Z and V can be used to cir- 
cumvent the problem. In effect, instead of updating (0, a) conditionally upon (Z, V) 
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we can jointly update {@,a, Z^'^\V^'^^) conditionally upon 

(^(ai)^y(ai))^ In practice, we 

only need to simulate {@,a) conditionally on {Z^^^\V^'^^'^). That poses no complication 
since (Z^^'-*, contains only finite amount of information about the hyperparameters. 

It is worth menti oning that a similar te c hniqu e for updating the hyperparameters was 



recommended by 
algorithm. 



MacEachern &: Miilleii (|l998l ) for the implementation of their no-gaps 



4. Comparison between marginal and conditional methods 

In this section we attempt a comparison in terms of Monte Carlo efficiency between the 
retrospective Markov chain Monte Carlo algorithm and state-of-the-art implementations 
of the marginal approach. To this end we have carried out a large simulation study 
part of which is presented later in this section. The methods are tested on the non- 
conjugate model where f{y | z) is a Gaussian density with z = {fi,a'^) and Hq is the 
product measure N[fi, cj^) x 10(7, /?); there are no further parameters A indexing f{y j z), 
G = (/u, cr^, 7, /3) and 10(7, /3) denotes the inverse Gamma distribution with density 
proportional to x~^"'^^^e~^^ . Note that, in a density - estirn ation context, this model 



allows for local smoothing; see for example 



Mulleretal 



(Il99fil ) and 



Green &: Richardso: 



mom . 

An e xcellent over view of different implementations of the marginal approach can be 
found in iNeall (|200d ) . The most succ essful implementations f or no n-conjugate models 



are the so-called no- gaps algorithm of 
and 8 of 



MacEachern fc Miillei] (jl998l ) and Algorithms 7 



Neall ([20001), which were introduced to mitigate against certain computational 
inefficiencies of the no-gaps algorithm. 

Received Markov chain Monte Carlo wisdom suggests that marginal samplers ought 
to be pre ferred to c onditional ones. Some limited theory supports this view; see in 



particular iLiul (Il994l ). However, it is common for marginalisation to destroy conditional 
independence structure which usually assists the conditional sampler, since conditionally 
independent components are effectively updated in one block. Thus, it is difficult a priori 
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to decide which approach is preferable. 

In our comparison we have considered different simulated datasets and different 
prior specifications. We have simulated 4 datasets from two models. The 'lepto 100' 
and the 'lepto 1000' datasets consist respectively of 100 and 1000 draws from the uni- 
modal leptokurtic mixture, 0.67iV(0, 1) + 0.33iV(0.3, 0.25^). The 'bimod 100' ('bimod 
1000') dataset consists of 100 (1000) draws from the bi modal mixture, 0.5A^(— 1, 0.5 ^) + 



Green Sz RichardsonI (j200ll ). In 



0.5A^(1, 0.S'^); we have chosen these datasets following 
our simulation we have taken the datasets of size 1 00 to be subsets of t 
1000. We fix in a data-driven way as suggested by I Green &: Richardson 
denotes the range of the data, then we set ^ = R/2,az = i?, 7 = 2 and (3 = 0.02R^. 
Data-depen dent choice of hyperparame ters is commonly made in mixture models, see 



rose of size 
2OO1I): ifi? 



Richardson &: GreenI (|l997l ). We consider three different values of a, 0.2, 1 



for example 

and 5. We use a Gibbs move to update Zj = {Zj^\ Zj^^) for every j G /(^'): we update 
Zj^'^ given Zj^^ and the rest, and then Zj^^ given the new value of zj^"* and the rest. The 
same scheme is used to update the corresponding cluster parameters in the marginal 
algorithms. 

We monitor the convergence of four functionals of the updated variables: the number 
of clusters, M, the deviance D of the estimated density, and Z^\ for i = 1,2 in 'lepto' 



and for i = 2, 3 in ' 



aimod'. These functionals 



m 



Neall pood ) and 



rave been used in the comparison studies 



Green &: RichardsonI (120011 ) to monitor algorithmic performance. In 



both cases, the components monitored were chosen to be ones whose allocations were 
particularly well- and badly-identified by the data. The deviance D is calculated as 
follows 

n 

jg/{al) 



see 



1=1 



Green Sz RichardsonI (120011 ) for details. Although we have given the expression in 
terms of the output of the conditional algorithm, a similar expression exists given the 
output of the marginal algorithms. The deviance is chosen as a meaningful function of 



several parameters of interest. 
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The efficiency of the sampler is summarised by reporting for each of the monitored 
variables the estimated integrated autocorrelation time, r = 1 + 2^°^j^/9j, where pj is 
the lag-j autocorrelation of the monitored chain. This is a standard way of meas uring the 



speed of convergen ce of square- integr able function s of a n ergo dic M arkov chain (IRoberts 



996 



2001 



Sokall . 



Neall mm and 



Green &: Richardson 



19971 ) which has also been used by 
in their simulation studies. Recall that the integrated autocorrelation time 
is proportional to the asymptotic variance of the ergodic average. In particular, if 
ri/r2 = 6 > 1, where is the integrated autocorrelation time of algorithm i for a specific 
functional, then Algorithm 1 requires roughly b times as many iterations to achieve the 
same Monte Carlo error as Algorithm 2, for the estimation of the specific functional. 
Estirnation of r i s a notoriously difficult problem. We have followed the guidelines in 
§3 of ISokall (|l997l ). We estimate r by summing estimated autocorrelations up to a fixed 



lag L, where t « L « N, and is the Monte Carlo sample s ize. Approx imate 



standard errors of the estimate can be obtained; see formula (3.19) of ISokall (119971 ). For 
the datasets and prior specifications we have considered we have found that A = 2 x 10^ 
suffices in order to assess the relative performance of the competing algorithms. 

The results of our comparison are reported in Tables d] - El We contrast our ret- 



rospective Markov chain Monte Carlo algorit hm with three margin a 



improved version of the no-gaps algorithm of 



MacEachern &: Miilleii (119981 1. where we 



algo rithms: an 



upda t ed de ad-cluster parameters after each allocation variable update. Algorithm 7 of 



Neal 



modi ), and Algorithm 8 of 



Neal 



(j2000l ). where we use three auxiliary states. The 



results show that Algorithms 7 and 8 perform better than the competitors, although 
the difference among the algorithms is moderate. In this comparison we have not taken 
into account the computing times of the different methods. Our implementation, which 
however did not aim at optimizing computational time, in FORTRAN 77 suggests that 
no-gaps. Algorithm 7 and the retrospective Markov chain Monte Carlo algorithm all 
have roughly similar computing times when a = 1. Algorithm 8 is more intensive than 
Algorithm 7. The computing time of the retrospective algorithm increases with the value 
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of a. 

Careful inspection of the output of the algorithms has suggested a possible reason 
why the conditional approach is outperformed by the marginal approaches. This is 
because it has to explore multiple modes in the posterior distribution of the random 
measure {j)^Z\^ see for example Figure [2] for results concerning the 'bimod' dataset. On 
the other hand, the ambiguity in the cluster labelling is not important in the marginal 
approaches, which work with the unidentifiable allocation structure and do not need to 
explore a multimodal distribution. This indicates that the marginal approaches achieve 
generally smaller integrated autocorrelation times compared to the conditional approach. 
Nevertheless, the label-switching moves we have included have substantially improved 
the performance of our algorithm. 

5. Discussion 

The appeal of the conditional approach lies in its potential for inferring for the latent 
random measure, which we have illustrated, and in its flexibility to be extended to more 
general stick-breaking random measures than the Dirichlet process. With respect to the 
latter, we have not explicitly shown how to extend our methods to more general models, 
but it should be obvious that such extensions are direct. In particular, Proposition 1 will 
have to be adapted accordingly, but all the crucial conditional independence structure 
which allows retrospective sampling will be present in the more general contexts. 

In extending this work, we have already discussed in §3-31 an exact Gibbs sampler, 
i.e. one in which the allocation variables are simulated directly from their conditional 
posterior distributions. If the likelihood function is unbounded, this has to be carried 
out by an intricate coupling of the Dirichlet process which permits tight bounds on 
the normalising constants Cj and also allows retrospective simulation of all related vari- 
ables. Although implementation of the resulting algorithm is simple to implement, the 
mathematical construction behind this method is very cimplicated and will be reported 
elsewhere. 
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Retrospective sampling is a methodology with great potential for other problems 
involving simulation and inference for stochastic processes. One major application which 
has emerged since the completion of this re s earch , is the exact simulation and estimation 



of diffusion processes (jBeskos et al. 



2006al . 



20051 . 



2006b|) 



Acknowledgement 

We are grateful to Igor Pruenster for several constructive comments. Moreover, we 
would like to thank Stephen Walker, Radford Neal, Peter Green, Steven MacEachern, 
two anonymous referees and the editor for valuable suggestions. 

References 

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to bayesian 
nonparametric problems. Ann. Statist. 2, 1152-74. 

Beskos, a., Papaspiliopoulos, O. & Roberts, G. O. (2005). A new factorisation 
of diffusion measure and finite sample path constructions. To appear in Methodology 
and Computing in Applied Probability. 

Beskos, A., Papaspiliopoulos, O. & Roberts, G. O. (2006a). Retrospective exact 
simulation of diffusion sample paths with applications. Bernoulli 12, 1077-98. 

Beskos, A., Papaspiliopoulos, O., Roberts, G. O. & Fearnhead, P. (2006b). 
Exact and efficient likelihood-based inference for discretely observed diffusions (with 
Discussion). J. Roy. Statist. Soc. B 68, 333-82. 

Blackwell, D. & MacQueen, J. B. (1973). Ferguson distributions via Polya urn 
schemes. Ann. Statist. 1, 353-5. 

Burr, D. & Doss, H. (2005). A Bayesian semiparametric model for random-effects 
meta-analysis. J. Am. Statist. Assoc. 100, 242-51. 



25 



ClFARELLi, D. M. & Regazzini, E. (1990). Distribution functions of means of a 
Dirichlet process. Ann. Statist. 18, 429-42. 

Doss, H. (1994). Bayesian nonparametric estimation for incomplete data via successive 
substitution sampling. Ann. Statist. 22, 1763-86. 

DuNSON, D. & Park, J. (2006). Kernel stick-breaking processes, submitted, available 
from http:/ / ftp. stat. duke. edu/WorkingPapers/ 06-22.pdf. 

Fearnhead, p. (2004). Particle filters for mixture models with an unknown number of 
components. Statist. Comp. 14, 11-21. 

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. 
Statist. 1, 209-30. 

Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. Ann. 
Statist. 2, 615-29. 

Ferguson, T. S. (1983). Bayesian density estimation by mixtures of normal distri- 
butions. In H. Chernoff, M. Haseeb Rizvi, J. Rustagi &; D. Siegmund, eds.. Recent 
Advances in Statistics: Papers in Honor of Herman Chernoff on His Sixtieth Birthday. 
New York: Academic Press, pp. 287-302. 

Gelfand, a. & KoTTAS, A. (2003). Bayesian semiparametric regression for median 
residual life. Scand. J. Statist. 30, 651-65. 

Gelfand, A. E. & Kottas, A. (2002). A computational approach for full nonpara- 
metric Bayesian inference under Dirichlet process mixture models. J. Comp. Craph. 
Statist. 11, 289-305. 

Green, P. & Richardson, S. (2001). Modelling heterogeneity with and without the 
Dirichlet process. Scand. J. Statist. 28 355-75. 



26 



Griffin, J. (2006). On the Bayesian analysis of species sampling 

mixture models for density estimation. submitted, available from 

http://www2.warwickMC.uk/Jac/sci/statistics/staff/academic/griffin/personal/densityestimation.pdf. 

GuGLiELMi, A., Holmes, C. C. & Walker, S. G. (2002). Perfect simulation involving 
functionals of a Dirichlet process. J. Comp. Graph. Statist. 11, 306-10. 

GUGLIELMI, A. Sz TWEEDIE, R. L. (2001). Markov chain Monte Carlo estimation of 

the law of the mean of a Dirichlet process. Bernoulli 7, 573-92. 

ISHWARAN, H. & James, L. (2001). Gibbs sampling methods for stick-breaking priors. 
J. Am. Statist. Assoc. 96, 161-73. 

ISHWARAN, H. & James, L. F. (2003). Some further developments for stick-breaking 
priors: finite and infinite clustering and classification. Sankhyd, A, 65, 577-92. 

ISHWARAN, H. & Zarepour, M. (2000). Markov chain Monte Carlo in approximate 
Dirichlet and beta two-parameter process hierarchical models. Biometrika 87, 371-90. 

ISHWARAN, H. & Zarepour, M. (2002). Exact and approximate sum-representations 
for the dirichlet process. Can. J. Statist. 30, 269-83. 

Jain, S. & Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure 
for the Dirichlet process mixture model. J. Comp. Graph. Statist. 13, 158-82. 

Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applica- 
tions to a gene regulation problem. J. Am. Statist. Assoc 89, 958-66. 

Liu, J. S. (1996). Nonparametric hierarchical Bayes via sequential imputations. Ann. 
Statist. 24, 911-30. 

Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates. I. Density estimates. 
Ann. Statist. 12, 351-7. 

MacEachern, S. & MULLER, P. (1998). Estimating mixture of Dirichlet process 
models. J. Comp. Graph. Statist. 7, 223-38. 

27 



Mengersen, K. L. &; TWEEDIE, R. L. (1996). Rates of convergence of the Hastings 
and Metropolis algorithms. Ann. Statist. 24 101-21. 

MULLER, P., Erkanli, A. & WEST, M. (1996). Bayesian curve fitting using multivari- 
ate normal mixtures. Biometrika 83, 67-79. 

MuLLER, P., ROSNER, G. L., De Iorio, M. & MacEachern, S. (2005). A nonpara- 
metric Bayesian model for inference in related longitudinal studies. Appl. Statist. 54, 
611-26. 

Neal, R. (2000). Markov chain sampling: Methods for Dirichlet process mixture models. 
J. Comp. Graph. Statist. 9, 283-97. 

Papaspiliopoulos, O., Roberts, G. O. & Skold, M. (2003). Non-centered parame- 
terisations for hierarchical models and data augmentation. In J. Bernardo, M. Bayarri, 
J. Berger, A. Dawid, D. Heckerman, A. Smith & M. West, eds., Bayesian Statistics 7. 
Oxford: Oxford University Press, pp. 307-27. 

Papaspiliopoulos, O., Roberts, G. O. & Skold, M. (2006). A general framework 
for parametrisation of hierarchical models, to appear in Statist. Sci. . 

Porteous, I., Ihter, a., Smyth, P. & Welling, M. (2006). Gibbs sampling for 
(coupled) infinite mixture models in the stick breaking representation. In Proceedings 
of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06). 
Arlington, Virginia: AUAI Press. 

QuiNTANA, F. & IGLESIAS, P. (2003). Bayesian clustering and product partition models. 
J. Roy. Statist. Soc. B 65, 557-574. 

Richardson, S. & Green, P. J. (1997). On Bayesian analysis of mixtures with an 
unknown number of components (with Discussion). J. R. Statist. Soc. B 59, 731-92. 

Ripley, B. D. (1987). Stochastic Simulation. Chichester: Wiley. 



28 



Roberts, G. O. (1996). Markov chain concepts related to sampling algortihms. In 
W. Gilks, S. Richardson & D. Spiegelhalter, eds., MCMC in Practice. London: Chap- 
man and Hall, pp. 45-57. 

Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4, 
639-50. 

SOKAL, A. (1997). Monte carlo methods in statistical mechanics: foundations and new 
algorithms. In Functional Integration (Cargese, 1996), vol. 361, of NATO Adv. Set. 
Inst. Ser. B Phys. New York: Plenum, pp. 131-92. 

Teh, Y., Jordan, M., Beal, M. & Blei, D. (2006). Hierarchical 
dirichlet processes. to appear in J. Amer. Statist. Assoc., available from 
http://www. cs.princeton. edu/ blei/papers/TehJordanBealBlei2006.pdf. 

(a) (b) (c) 




Figure 1: Posterior densities of pi (solid), p2 (dashed) and ps (dashed with diamonds), 
corresponding to (a) a dataset with n = 10 separated into three clusters of sizes ni = 
5, n2 = 4, ni and (b) a dataset with n = 100, ni = 50, n2 = 40 and 113 = 10. (c) shows 
the posterior density of maxjjpj} for n = 10 (dashed) and n = 100 (solid). 
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M D Zk, Zk^ 

Method a = 1 

Retrospective 41.42 (2.6) 3.28 (0.21) 3.9 (0.25) 3.44 (0.22) 

No-gaps 45.94 (1.52) 3.84 (0.13) 3.66 (0.12) 2.82 (0.09) 

Algorithm 7 21.85 (0.77) 2.48 (0.09) 3.47 (0.12) 2.91 (0.10) 

Algorithm 8 18.21 (0.66) 2.94 (0.11) 3.44 (0.13) 3.1 (0.12) 

a = 0.2 

Retrospective 67.0 (4.24) 6.8 (0.43) 8.01 (0.51) 3.44 (0.22) 

No-gaps 39.44 (1.31) 3.8 (0.13) 5.93 (0.2) 3.1 (0.10) 

Algorithm 7 24.99 (0.88) 2.87 (0.10) 6.32 (0.22) 2.95 (0.10) 

Algorithm 8 22.10 (0.85) 5.30 (0.20) 6.8 (0.26) 3.03 (0.12) 

a = 5 

Retrospective 21.86 (1.38) 2.82 (0.18) 2.01 (0.13) 2.5 (0.16) 

No-gaps 57.09 (1.81) 2.99 (0.09) 1.67 (0.05) 2.01 (0.06) 

Algorithm 7 12.55 (0.4) 1.77 (0.06) 1.64 (0.05) 1.97 (0.06) 

Algorithm 8 8.2 (0.26) 1.77 (0.06) 1.64 (0.05) 1.97 (0.06) 

Table 1: Estimated integrated autocorrelation times for the number of clusters M, the 
deviance D, Zk^ and Zk^, for the 'bimod 100' dataset. Estimates of the standard error 
in parenthesis. The initial state of all chains was all data allocated to the same cluster 
with parameters drawn from the prior. 
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M D Zk, Zk^ 

Method a = 1 

Retrospective 40.71 (2.58) 31.99 (2.01) 46.58 (2.95) 3.04 (0.19) 

No-gaps 46.08 (1.46) 23.93 (0.76) 33.19 (1.05) 2.37 (0.07) 

Algorithm 7 22.98 (0.73) 20.17 (0.64) 28.28 (0.89) 2.33 (0.07) 

Algorithm 8 18.02 (0.57) 18.91 (0.6) 26.71 (0.85) 2.06 (0.07) 

a = 0.2 

Retrospective 239.07 (15.12) 286.49 (18.12) 157.85 (9.99) 14.87 (0.94) 

No-gaps 127.08 (6.96) 151.90 (8.32) 97.73 (5.35) 7.46 (0.41) 

Algorithm 7 109.37 (5.99) 171.98 (9.42) 86.26 (4.73) 7.95 (0.44) 

Algorithm 8 99.06 (5.43) 142.93 (7.83) 82.38 (4.51) 6.98 (0.38) 

a = 5 

Retrospective 13.69 (0.87) 7.38 (0.47) 5.9 (0.37) 1.61 (0.1) 

No-gaps 44.25 (1.4) 5.72 (0.18) 4.14 (0.13) 1.36 (0.04) 

Algorithm 7 10.57 (0.33) 5.55 (0.18) 3.52 (0.11) 1.33 (0.04) 

Algorithm 8 6.32 (0.2) 5.31 (0.17) 3.23 (0.10) 1.29 (0.04) 

Table 2: Estimated integrated autocorrelation times for the number of clusters M, the 
deviance D, Zk^ and Zk^, for the 'lepto 100' dataset. Estimates of the standard errors 
in parenthesis. The initial state of all chains was all data allocated to the same cluster 
with parameters drawn from the prior. 

'bimod 1000', a = 1 'lepto 1000', a = 1 
M DM 

Retrospective 149 (7) 254 (25) 205 (21) 

No-gaps 91 (4) 133 (6) 102 (5) 

Algorithm 7 60 (3) 87 (4) 99 (4) 

Algorithm 8 58 (3) 112 (5) 104 (5) 

Table 3: Estimated integrated autocorrelation times for the 'bimod 1000' and 'lepto 
1000' datasets. Ther results for D in the 'bimod 1000' data set, Zk'^-,Zk2 and Zk^, for 
both datasets were not markedly different across the algorithms so are omitted. 
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Figure 2: Posterior densities of Z\ (left), Z2 (middle) and Z3 (right) for the 'bimod 100' 
(top) and the 'bimod 1000' (bottom) datasets. All results have been obtained for a = 1. 
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